NSF PAR Search | NSF Public Access Repository

Note: When clicking on a Digital Object Identifier (DOI) number, you will be taken to an external site maintained by the publisher. Some full text articles may not yet be available without a charge during the embargo (administrative interval).
What is a DOI Number?

Some links on this page may take you to non-federal websites. Their policies may differ from this site.

Task Descriptors Help Transformers Learn Linear Models In-Context

Huang, Ruomin; Ge, Rong (April 2025, ICLR 2025)

Large language models (LLMs) exhibit strong in-context learning (ICL) ability, which allows the model to make predictions on new examples based on the given prompt. Recently, a line of research (Von Oswald et al., 2023; Aky¨urek et al., 2023; Ahn et al., 2023; Mahankali et al., 2023; Zhang et al., 2024) considered ICL for a simple linear regression setting and showed that the forward pass of Transformers is simulating some variants of gradient descent (GD) algorithms on the in-context examples. In practice, the input prompt usually contains a task descriptor in addition to in-context examples. We investigate how the task description helps ICL in the linear regression setting. Consider a simple setting where the task descriptor describes the mean of input in linear regression. Our results show that gradient flow converges to a global minimum for a linear Transformer. At the global minimum, the Transformer learns to use the task descriptor effectively to improve its performance. Empirically, we verify our results by showing that the weights converge to the predicted global minimum and Transformers indeed perform better with task descriptors.
more » « less
Full Text Available
Reassessing how to compare and improve the calibration of machine learning models

Chidambaram, Muthu; Ge, Rong (April 2025, ICLR 2025)

A machine learning model is calibrated if its predicted probability for an outcome matches the observed frequency for that outcome conditional on the model prediction. This property has become increasingly important as the impact of machine learning models has continued to spread to various domains. As a result, there are now a dizzying number of recent papers on measuring and improving the calibration of (specifically deep learning) models. In this work, we reassess the reporting of calibration metrics in the recent literature. We show that there exist trivial recalibration approaches that can appear seemingly state-of-the-art unless calibration and prediction metrics (i.e. test accuracy) are accompanied by additional generalization metrics such as negative log-likelihood. We then use a calibration-based decomposition of Bregman divergences to develop a new extension to reliability diagrams that jointly visualizes calibration and generalization error, and show how our visualization can be used to detect trade-offs between calibration and generalization. Along the way, we prove novel results regarding the relationship between full calibration error and confidence calibration error for Bregman divergences. We also establish the consistency of the kernel regression estimator for calibration error used in our visualization approach, which generalizes existing consistency results in the literature.
more » « less
Full Text Available
Cavity induced many-body localization

https://doi.org/10.1103/PhysRevB.111.155416

Ge, Rong-Chun; Koshkaki, Saeed Rahmanian; Kolodrubetz, Michael H (April 2025, Physical Review B)

Full Text Available
How does Gradient Descent Learn Features---A Local Analysis for Regularized Two-Layer Neural Networks

Zhou, Mo; Ge, Rong (December 2024, NeurIPS 2024)

The ability of learning useful features is one of the major advantages of neural networks. Although recent works show that neural network can operate in a neural tangent kernel (NTK) regime that does not allow feature learning, many works also demonstrate the potential for neural networks to go beyond NTK regime and perform feature learning. Recently, a line of work highlighted the feature learning capabilities of the early stages of gradient-based training. In this paper we consider another mechanism for feature learning via gradient descent through a local convergence analysis. We show that once the loss is below a certain threshold, gradient descent with a carefully regularized objective will capture ground-truth directions. We further strengthen this local convergence analysis by incorporating early-stage feature learning analysis. Our results demonstrate that feature learning not only happens at the initial gradient steps, but can also occur towards the end of training.
more » « less
Full Text Available
Mean-field analysis for learning subspace-sparse polynomials with Gaussian input

Chen, Ziang; Ge, Rong (December 2024, NeurIPS 2024)

In this work, we study the mean-field flow for learning subspace-sparse polynomials using stochastic gradient descent and two-layer neural networks, where the input distribution is standard Gaussian and the output only depends on the projection of the input onto a low-dimensional subspace. We establish a necessary condition for SGD-learnability, involving both the characteristics of the target function and the expressiveness of the activation function. In addition, we prove that the condition is almost sufficient, in the sense that a condition slightly stronger than the necessary condition can guarantee the exponential decay of the loss functional to zero.
more » « less
Full Text Available
Linear transformers are versatile in-context learners

Vladymyrov, Max; Oswald, Johannes Von; Sandler, Mark; Ge, Rong (December 2024, NeurIPS 2024)

Recent research has demonstrated that transformers, particularly linear attention models, implicitly execute gradient-descent-like algorithms on data provided in-context during their forward inference step. However, their capability in handling more complex problems remains unexplored. In this paper, we prove that each layer of a linear transformer maintains a weight vector for an implicit linear regression problem and can be interpreted as performing a variant of preconditioned gradient descent. We also investigate the use of linear transformers in a challenging scenario where the training data is corrupted with different levels of noise. Remarkably, we demonstrate that for this problem linear transformers discover an intricate and highly effective optimization algorithm, surpassing or matching in performance many reasonable baselines. We analyze this algorithm and show that it is a novel approach incorporating momentum and adaptive rescaling based on noise levels. Our findings show that even linear transformers possess the surprising ability to discover sophisticated optimization strategies.
more » « less
Full Text Available
Thermal Behaviors in Liquid Immersion Cooling under Various Workloads: a Case Study

https://doi.org/10.1109/IGSC64514.2024.00030

Randall, Thomas; Cooper, Bennett; Kulshreshtha, Naman; Ge, Rong (November 2024, IEEE)

The growing need for energy-efficient computing has led to many novel system innovations, including liquid immersion cooling. While many myths about the technology have been dispelled, the actual impact of this cooling solution on thermal conditions in real computing scenarios remains under-reported and under-studied. In this work, we collate data from multiple system monitoring tools to perform case-study analyses of the thermal behaviors of immersed hardware, aiming to evaluate the effectiveness of liquid immersion cooling for high-performance and datacenter applications.
more » « less
Full Text Available
Vendor-neutral and Production-grade Job Power Management in High Performance Computing

https://doi.org/10.1109/SCW63240.2024.00231

Kulshreshtha, Naman; Patki, Tapasya; Garlick, Jim; Grondona, Mark; Ge, Rong (November 2024, IEEE)

Power management and energy efficiency are critical research areas for exascale computing and beyond, necessitating reliable telemetry and control for distributed systems. Despite this need, existing approaches present several limitations precluding their adoption in production. These limitations include, but are not limited to, lack of portability due to vendor-specific and closed-source solutions, lack of support for non-MPI applications, and lack of user-level customization. We present a job-level power management framework based on Flux. We introduce flux-power-monitor and demonstrate its effectiveness on the Lassen (IBM Power AC922) and Tioga (HPE Cray EX235A) systems with a low average overhead of 0.4%. We also present flux-power-manager, where we discuss a proportional sharing policy and introduce a hierarchical FFT-based dynamic power management algorithm (FPP). We demonstrate that FPP reduces energy by 1% compared to proportional sharing, and by 20% compared to the default IBM static power capping policy.
more » « less
Full Text Available
Shared Virtual Memory: Its Design and Performance Implications for Diverse Applications

https://doi.org/10.1145/3650200.3656608

Cooper, Bennett; Scogland, Thomas RW; Ge, Rong (May 2024, ACM)

Full Text Available
Fine-grain Quantitative Analysis of Demand Paging in Unified Virtual Memory

https://doi.org/10.1145/3632953

Allen, Tyler; Cooper, Bennett; Ge, Rong (March 2024, ACM Transactions on Architecture and Code Optimization)

The abstraction of a shared memory space over separate CPU and GPU memory domains has eased the burden of portability for many HPC codebases. However, users pay for ease of use provided by system-managed memory with a moderate-to-high performance overhead. NVIDIA Unified Virtual Memory (UVM) is currently the primary real-world implementation of such abstraction and offers a functionally equivalent testbed for in-depth performance study for both UVM and future Linux Heterogeneous Memory Management (HMM) compatible systems. The continued advocacy for UVM and HMM motivates improvement of the underlying system. We focus on UVM-based systems and investigate the root causes of UVM overhead, a non-trivial task due to complex interactions of multiple hardware and software constituents and the desired cost granularity. In our prior work, we delved deeply into UVM system architecture and showed internal behaviors of page fault servicing in batches. We provided quantitative evaluation of batch handling for various applications under different scenarios, including prefetching and oversubscription. We revealed that the driver workload depends on the interactions among application access patterns, GPU hardware constraints, and host OS components. Host OS components have significant overhead present across implementations, warranting close attention. This extension furthers our prior study in three aspects: fine-grain cost analysis and breakdown, extension to multiple GPUs, and investigation of platforms with different GPU-GPU interconnects. We take a top-down approach to quantitative batch analysis and uncover how constituent component costs accumulate and overlap, governed by synchronous and asynchronous operations. Our multi-GPU analysis shows reduced cost of GPU-GPU batch workloads compared to CPU-GPU workloads. We further demonstrate that while specialized interconnects, NVLink, can improve batch cost, their benefits are limited by host OS software overhead and GPU oversubscription. This study serves as a proxy for future shared memory systems, such as those that interface with HMM, and the development of interconnects.
more » « less
Full Text Available

« Prev Next »

Search for: All records